🧠 Compositional Reasoning Evaluation Report

Evaluating model's ability to handle nested function compositions across multiple depth levels

Overall Accuracy
54.0%
270/500 Correct
Levels Tested
5
Composition Depths
Total Samples
500
Test Cases

📊 Accuracy by Composition Depth

Level Functions Nested Samples Correct Accuracy Answer Rate
Level 1 1 function 100 75 75.0% 100.0%
Level 2 2 functions 100 77 77.0% 98.0%
Level 3 3 functions 100 58 58.0% 94.0%
Level 4 4 functions 100 39 39.0% 84.0%
Level 5 5 functions 100 21 21.0% 68.0%

📊 Level 1 - 1 Function Nested

Accuracy
75.0%
75/100 Correct
Answer Rate
100.0%
100/100 Answered
Total Samples
100
Test Cases

📈 Match Type Distribution

Match Type Count Percentage
exact 75 75.0%
mismatch 25 25.0%

🔍 Failure Analysis

Failure Type Count Percentage
No Answer Generated 0 0.0%
Wrong Answer 25 100.0%

Common Failure Patterns:

  • Wrong Answers (25): Model completed execution but computed incorrect result

📋 Sample Details

comp_d1_s0 ✓ CORRECT exact

Expected: tgtrxck | Predicted: tgtrxck

comp_d1_s1 ✓ CORRECT exact

Expected: fnaf | Predicted: fnaf

comp_d1_s10 ✓ CORRECT exact

Expected: cbcqn | Predicted: cbcqn

comp_d1_s11 ✓ CORRECT exact

Expected: kkppmmgg | Predicted: kkppmmgg

comp_d1_s12 ✓ CORRECT exact

Expected: rrccxx | Predicted: rrccxx

comp_d1_s14 ✓ CORRECT exact

Expected: hqpdounai | Predicted: hqpdounai

comp_d1_s15 ✓ CORRECT exact

Expected: ywvh | Predicted: ywvh

comp_d1_s16 ✓ CORRECT exact

Expected: wycmbt | Predicted: wycmbt

comp_d1_s19 ✓ CORRECT exact

Expected: pasbba | Predicted: pasbba

comp_d1_s20 ✓ CORRECT exact

Expected: moswoh | Predicted: moswoh

comp_d1_s23 ✓ CORRECT exact

Expected: vwczyw | Predicted: vwczyw

comp_d1_s24 ✓ CORRECT exact

Expected: eakmfk | Predicted: eakmfk

comp_d1_s25 ✓ CORRECT exact

Expected: kyz4s2h | Predicted: kyz4s2h

comp_d1_s26 ✓ CORRECT exact

Expected: tbp | Predicted: tbp

comp_d1_s27 ✓ CORRECT exact

Expected: bbnnwwbbbb | Predicted: bbnnwwbbbb

comp_d1_s3 ✓ CORRECT exact

Expected: jigj | Predicted: jigj

comp_d1_s30 ✓ CORRECT exact

Expected: bfgy | Predicted: bfgy

comp_d1_s31 ✓ CORRECT exact

Expected: jepzravhr | Predicted: jepzravhr

comp_d1_s32 ✓ CORRECT exact

Expected: hfirwngdco | Predicted: hfirwngdco

comp_d1_s33 ✓ CORRECT exact

Expected: qvbi | Predicted: qvbi

comp_d1_s34 ✓ CORRECT exact

Expected: cyrzels | Predicted: cyrzels

comp_d1_s35 ✓ CORRECT exact

Expected: ixuajol | Predicted: ixuajol

comp_d1_s36 ✓ CORRECT exact

Expected: fgcuwkqe | Predicted: fgcuwkqe

comp_d1_s37 ✓ CORRECT exact

Expected: zvwattawvz | Predicted: zvwattawvz

comp_d1_s38 ✓ CORRECT exact

Expected: cwwfgu | Predicted: cwwfgu

comp_d1_s4 ✓ CORRECT exact

Expected: nzjovqwpsb | Predicted: nzjovqwpsb

comp_d1_s41 ✓ CORRECT exact

Expected: cls | Predicted: cls

comp_d1_s42 ✓ CORRECT exact

Expected: gsko | Predicted: gsko

comp_d1_s43 ✓ CORRECT exact

Expected: jbriiw | Predicted: jbriiw

comp_d1_s45 ✓ CORRECT exact

Expected: lprmlfokvh | Predicted: lprmlfokvh

comp_d1_s46 ✓ CORRECT exact

Expected: rhrqc | Predicted: rhrqc

comp_d1_s48 ✓ CORRECT exact

Expected: ssjjkkddrrxx | Predicted: ssjjkkddrrxx

comp_d1_s49 ✓ CORRECT exact

Expected: fusbgwlurzpyx | Predicted: fusbgwlurzpyx

comp_d1_s5 ✓ CORRECT exact

Expected: zwwjlveevljwwz | Predicted: zwwjlveevljwwz

comp_d1_s50 ✓ CORRECT exact

Expected: voxtmgqt | Predicted: voxtmgqt

comp_d1_s51 ✓ CORRECT exact

Expected: qqhhcchhhhiiooddggss | Predicted: qqhhcchhhhiiooddggss

comp_d1_s54 ✓ CORRECT exact

Expected: mmhhmmkkrruu | Predicted: mmhhmmkkrruu

comp_d1_s55 ✓ CORRECT exact

Expected: prclwegilxzn | Predicted: prclwegilxzn

comp_d1_s56 ✓ CORRECT exact

Expected: gcaoztv | Predicted: gcaoztv

comp_d1_s57 ✓ CORRECT exact

Expected: qtyu | Predicted: qtyu

comp_d1_s58 ✓ CORRECT exact

Expected: qgmmgq | Predicted: qgmmgq

comp_d1_s59 ✓ CORRECT exact

Expected: ymgjueip | Predicted: ymgjueip

comp_d1_s6 ✓ CORRECT exact

Expected: rbqngms | Predicted: rbqngms

comp_d1_s60 ✓ CORRECT exact

Expected: bckpqrsuw | Predicted: bckpqrsuw

comp_d1_s61 ✓ CORRECT exact

Expected: ujljakmljxld | Predicted: ujljakmljxld

comp_d1_s62 ✓ CORRECT exact

Expected: mvpigyhltonqqx | Predicted: mvpigyhltonqqx

comp_d1_s63 ✓ CORRECT exact

Expected: aawweeffuuiiwwss | Predicted: aawweeffuuiiwwss

comp_d1_s65 ✓ CORRECT exact

Expected: ooggmmssllaahhkkggrr | Predicted: ooggmmssllaahhkkggrr

comp_d1_s66 ✓ CORRECT exact

Expected: ishgdflt | Predicted: ishgdflt

comp_d1_s67 ✓ CORRECT exact

Expected: ntefhptpzvhjhs | Predicted: ntefhptpzvhjhs

comp_d1_s68 ✓ CORRECT exact

Expected: qtcahbbh | Predicted: qtcahbbh

comp_d1_s69 ✓ CORRECT exact

Expected: 3zgwcbmprd | Predicted: 3zgwcbmprd

comp_d1_s7 ✓ CORRECT exact

Expected: afiruv | Predicted: afiruv

comp_d1_s70 ✓ CORRECT exact

Expected: cgokpqu | Predicted: cgokpqu

comp_d1_s72 ✓ CORRECT exact

Expected: abee | Predicted: abee

comp_d1_s73 ✓ CORRECT exact

Expected: hgsslry5qr | Predicted: hgsslry5qr

comp_d1_s76 ✓ CORRECT exact

Expected: cvz | Predicted: cvz

comp_d1_s77 ✓ CORRECT exact

Expected: zrh | Predicted: zrh

comp_d1_s78 ✓ CORRECT exact

Expected: mqxep | Predicted: mqxep

comp_d1_s79 ✓ CORRECT exact

Expected: mk3c | Predicted: mk3c

comp_d1_s8 ✓ CORRECT exact

Expected: fiwy | Predicted: fiwy

comp_d1_s80 ✓ CORRECT exact

Expected: voliizknzkmr | Predicted: voliizknzkmr

comp_d1_s81 ✓ CORRECT exact

Expected: kkeejjttqqttffooyyll | Predicted: kkeejjttqqttffooyyll

comp_d1_s82 ✓ CORRECT exact

Expected: 1q5s | Predicted: 1q5s

comp_d1_s83 ✓ CORRECT exact

Expected: coz | Predicted: coz

comp_d1_s84 ✓ CORRECT exact

Expected: yqenhvzgoj | Predicted: yqenhvzgoj

comp_d1_s85 ✓ CORRECT exact

Expected: whuky | Predicted: whuky

comp_d1_s88 ✓ CORRECT exact

Expected: tsvkrvzm | Predicted: tsvkrvzm

comp_d1_s89 ✓ CORRECT exact

Expected: npwwl | Predicted: npwwl

comp_d1_s91 ✓ CORRECT exact

Expected: moqgrhyqx | Predicted: moqgrhyqx

comp_d1_s92 ✓ CORRECT exact

Expected: lcqwwqcl | Predicted: lcqwwqcl

comp_d1_s94 ✓ CORRECT exact

Expected: reegk | Predicted: reegk

comp_d1_s95 ✓ CORRECT exact

Expected: qgxdlgau | Predicted: qgxdlgau

comp_d1_s98 ✓ CORRECT exact

Expected: aawwuuzzxxttvv | Predicted: aawwuuzzxxttvv

comp_d1_s99 ✓ CORRECT exact

Expected: jpqqrs | Predicted: jpqqrs

comp_d1_s13 ✗ INCORRECT mismatch

Expected: cegllnwy | Predicted: cceeeglnwy

comp_d1_s17 ✗ INCORRECT mismatch

Expected: gwlfory | Predicted: assert main_solution("gwlfo") == 'gwlfo' + 'ry

comp_d1_s18 ✗ INCORRECT mismatch

Expected: krjxlgbbglxjrk | Predicted: krjxlgbglxjrkk

comp_d1_s2 ✗ INCORRECT mismatch

Expected: ffppvvaauussiieeyy | Predicted: ffppvvaausssiieeyy

comp_d1_s21 ✗ INCORRECT mismatch

Expected: uecllmsrzcczrsmllceu | Predicted: uecllmsrzczcmsrllcue

comp_d1_s22 ✗ INCORRECT mismatch

Expected: uucceennddeemmww | Predicted: uuccceeddmmww

comp_d1_s28 ✗ INCORRECT mismatch

Expected: dxgpqkpnyynpkqpgxd | Predicted: dxgpqkpnyynpkpqpgxd

comp_d1_s29 ✗ INCORRECT mismatch

Expected: gkrhitbllbtihrkg | Predicted: gkrhitblblthirkg

comp_d1_s39 ✗ INCORRECT mismatch

Expected: h5f1f3wz | Predicted: h51113wz

comp_d1_s40 ✗ INCORRECT mismatch

Expected: ceknqyzz | Predicted: ceeknnqz

comp_d1_s44 ✗ INCORRECT mismatch

Expected: eijorrs | Predicted: eijorsr

comp_d1_s47 ✗ INCORRECT mismatch

Expected: ffagmwbbwmgaff | Predicted: ffagmwbwmgaff

comp_d1_s52 ✗ INCORRECT mismatch

Expected: kimbbmik | Predicted: kimbmbik

comp_d1_s53 ✗ INCORRECT mismatch

Expected: xxppsswwttjjaajjtt | Predicted: xxppsswwttjjajjtt

comp_d1_s64 ✗ INCORRECT mismatch

Expected: egn | Predicted: egen

comp_d1_s71 ✗ INCORRECT mismatch

Expected: xxhhpplloobbyyccttqq | Predicted: xxhhppllccbbyyccqq

comp_d1_s74 ✗ INCORRECT mismatch

Expected: cfdgdbupvvpubdgdfc | Predicted: cfdgdbupvvpubdgd fc

comp_d1_s75 ✗ INCORRECT mismatch

Expected: bcdilmnrsy | Predicted: bcddilmnrsy

comp_d1_s86 ✗ INCORRECT mismatch

Expected: vvhhhhppzzmmdd | Predicted: vvhhhhppzzmmd d

comp_d1_s87 ✗ INCORRECT mismatch

Expected: iioooolliieessoogguu | Predicted: iioooollieessoogguu

comp_d1_s9 ✗ INCORRECT mismatch

Expected: xqddt4tlp | Predicted: xqdd4t2tlp

comp_d1_s90 ✗ INCORRECT mismatch

Expected: aaddwwzzbbuuggqqkkcc | Predicted: aadwzzbbuuggqqkkcc

comp_d1_s93 ✗ INCORRECT mismatch

Expected: sxb5h | Predicted: sxb4h

comp_d1_s96 ✗ INCORRECT mismatch

Expected: bhyzz | Predicted: bhhyyz

comp_d1_s97 ✗ INCORRECT mismatch

Expected: fghkmoovzz | Predicted: fgghkmooovz

📊 Level 2 - 2 Functions Nested

Accuracy
77.0%
77/100 Correct
Answer Rate
98.0%
98/100 Answered
Total Samples
100
Test Cases

📈 Match Type Distribution

Match Type Count Percentage
exact 77 77.0%
mismatch 21 21.0%
no_answer 2 2.0%

🔍 Failure Analysis

Failure Type Count Percentage
No Answer Generated 2 8.7%
Wrong Answer 21 91.3%

Common Failure Patterns:

  • Wrong Answers (21): Model completed execution but computed incorrect result
  • Missing Answer Tags (2): Completed but didn't produce [ANSWER] tag

📋 Sample Details

comp_d2_s1 ✓ CORRECT exact

Expected: gmtuv | Predicted: gmtuv

comp_d2_s10 ✓ CORRECT exact

Expected: mntudk | Predicted: mntudk

comp_d2_s11 ✓ CORRECT exact

Expected: fpsxlzbmkrm | Predicted: fpsxlzbmkrm

comp_d2_s13 ✓ CORRECT exact

Expected: clnoruwx | Predicted: clnoruwx

comp_d2_s14 ✓ CORRECT exact

Expected: llrrjjddxxbbvvkkcc | Predicted: llrrjjddxxbbvvkkcc

comp_d2_s16 ✓ CORRECT exact

Expected: ppjjppffrrffcc | Predicted: ppjjppffrrffcc

comp_d2_s17 ✓ CORRECT exact

Expected: uvzvpqo | Predicted: uvzvpqo

comp_d2_s18 ✓ CORRECT exact

Expected: ahtxjed | Predicted: ahtxjed

comp_d2_s19 ✓ CORRECT exact

Expected: ttzzttffkkii | Predicted: ttzzttffkkii

comp_d2_s21 ✓ CORRECT exact

Expected: kkffnn33ccrrppnn33ff | Predicted: kkffnn33ccrrppnn33ff

comp_d2_s22 ✓ CORRECT exact

Expected: ytthdmllmdhtty | Predicted: ytthdmllmdhtty

comp_d2_s23 ✓ CORRECT exact

Expected: kkttllllsnvu | Predicted: kkttllllsnvu

comp_d2_s24 ✓ CORRECT exact

Expected: oskykze | Predicted: oskykze

comp_d2_s25 ✓ CORRECT exact

Expected: gpnk3332v | Predicted: gpnk3332v

comp_d2_s26 ✓ CORRECT exact

Expected: fujyggddffpptt | Predicted: fujyggddffpptt

comp_d2_s27 ✓ CORRECT exact

Expected: qzizdblhhbnimttl | Predicted: qzizdblhhbnimttl

comp_d2_s28 ✓ CORRECT exact

Expected: jhq | Predicted: jhq

comp_d2_s29 ✓ CORRECT exact

Expected: nmoulvq | Predicted: nmoulvq

comp_d2_s3 ✓ CORRECT exact

Expected: oqqmsqdgaw | Predicted: oqqmsqdgaw

comp_d2_s30 ✓ CORRECT exact

Expected: vzyl | Predicted: vzyl

comp_d2_s32 ✓ CORRECT exact

Expected: ipdjzcrh | Predicted: ipdjzcrh

comp_d2_s33 ✓ CORRECT exact

Expected: vvrrvvttrrnnqqllssccooqq | Predicted: vvrrvvttrrnnqqllssccooqq

comp_d2_s34 ✓ CORRECT exact

Expected: ymfdcef | Predicted: ymfdcef

comp_d2_s35 ✓ CORRECT exact

Expected: vwaippeeeessqq | Predicted: vwaippeeeessqq

comp_d2_s39 ✓ CORRECT exact

Expected: celvcxvx | Predicted: celvcxvx

comp_d2_s4 ✓ CORRECT exact

Expected: zdgnkjy | Predicted: zdgnkjy

comp_d2_s40 ✓ CORRECT exact

Expected: kkjjkk | Predicted: kkjjkk

comp_d2_s41 ✓ CORRECT exact

Expected: icdhvsdslu | Predicted: icdhvsdslu

comp_d2_s42 ✓ CORRECT exact

Expected: gvchhbkulcx | Predicted: gvchhbkulcx

comp_d2_s43 ✓ CORRECT exact

Expected: shajoe | Predicted: shajoe

comp_d2_s44 ✓ CORRECT exact

Expected: bnnkkmmqq | Predicted: bnnkkmmqq

comp_d2_s45 ✓ CORRECT exact

Expected: bgj | Predicted: bgj

comp_d2_s46 ✓ CORRECT exact

Expected: tdjivrzme | Predicted: tdjivrzme

comp_d2_s47 ✓ CORRECT exact

Expected: ulhbnzogtejxi | Predicted: ulhbnzogtejxi

comp_d2_s49 ✓ CORRECT exact

Expected: cpsv | Predicted: cpsv

comp_d2_s5 ✓ CORRECT exact

Expected: xrvqy | Predicted: xrvqy

comp_d2_s50 ✓ CORRECT exact

Expected: cehils | Predicted: cehils

comp_d2_s51 ✓ CORRECT exact

Expected: brkntqu | Predicted: brkntqu

comp_d2_s52 ✓ CORRECT exact

Expected: abujjopryjwr | Predicted: abujjopryjwr

comp_d2_s53 ✓ CORRECT exact

Expected: nkgctmkog | Predicted: nkgctmkog

comp_d2_s54 ✓ CORRECT exact

Expected: dhmwxz | Predicted: dhmwxz

comp_d2_s55 ✓ CORRECT exact

Expected: rs4hh2sslcrv | Predicted: rs4hh2sslcrv

comp_d2_s57 ✓ CORRECT exact

Expected: ntvbzqlriwbks | Predicted: ntvbzqlriwbks

comp_d2_s58 ✓ CORRECT exact

Expected: deisynhcfyhq | Predicted: deisynhcfyhq

comp_d2_s59 ✓ CORRECT exact

Expected: eyervvoovvaabbqqooqq | Predicted: eyervvoovvaabbqqooqq

comp_d2_s6 ✓ CORRECT exact

Expected: fgjjlqxz | Predicted: fgjjlqxz

comp_d2_s60 ✓ CORRECT exact

Expected: qhg5klr | Predicted: qhg5klr

comp_d2_s62 ✓ CORRECT exact

Expected: ggchhcgg | Predicted: ggchhcgg

comp_d2_s63 ✓ CORRECT exact

Expected: ggwweeiippiittmmnn | Predicted: ggwweeiippiittmmnn

comp_d2_s64 ✓ CORRECT exact

Expected: 1bfjmnrwyz | Predicted: 1bfjmnrwyz

comp_d2_s65 ✓ CORRECT exact

Expected: lww | Predicted: lww

comp_d2_s67 ✓ CORRECT exact

Expected: klov | Predicted: klov

comp_d2_s68 ✓ CORRECT exact

Expected: nnnnbbddff | Predicted: nnnnbbddff

comp_d2_s69 ✓ CORRECT exact

Expected: y3kqvpndxd | Predicted: y3kqvpndxd

comp_d2_s7 ✓ CORRECT exact

Expected: eeddwwllrrqqpp | Predicted: eeddwwllrrqqpp

comp_d2_s71 ✓ CORRECT exact

Expected: grzpfbmnsu | Predicted: grzpfbmnsu

comp_d2_s74 ✓ CORRECT exact

Expected: effghiosz | Predicted: effghiosz

comp_d2_s77 ✓ CORRECT exact

Expected: pgrnbkscua | Predicted: pgrnbkscua

comp_d2_s78 ✓ CORRECT exact

Expected: thbtqwm | Predicted: thbtqwm

comp_d2_s79 ✓ CORRECT exact

Expected: xxeewwwweexx | Predicted: xxeewwwweexx

comp_d2_s8 ✓ CORRECT exact

Expected: gdobtffw | Predicted: gdobtffw

comp_d2_s80 ✓ CORRECT exact

Expected: waoczhzuet | Predicted: waoczhzuet

comp_d2_s82 ✓ CORRECT exact

Expected: vwxcen | Predicted: vwxcen

comp_d2_s83 ✓ CORRECT exact

Expected: ggyhsaqsbmy | Predicted: ggyhsaqsbmy

comp_d2_s84 ✓ CORRECT exact

Expected: eryrqyjfym | Predicted: eryrqyjfym

comp_d2_s85 ✓ CORRECT exact

Expected: k4mnrfgt | Predicted: k4mnrfgt

comp_d2_s86 ✓ CORRECT exact

Expected: njxbboonnoo | Predicted: njxbboonnoo

comp_d2_s87 ✓ CORRECT exact

Expected: dkooriagafwlrpcnv | Predicted: dkooriagafwlrpcnv

comp_d2_s88 ✓ CORRECT exact

Expected: ibzzmhe | Predicted: ibzzmhe

comp_d2_s89 ✓ CORRECT exact

Expected: jjrrrrooqqjjeennwwxx | Predicted: jjrrrrooqqjjeennwwxx

comp_d2_s9 ✓ CORRECT exact

Expected: djuawddcw | Predicted: djuawddcw

comp_d2_s90 ✓ CORRECT exact

Expected: ronvgjhysczvtz | Predicted: ronvgjhysczvtz

comp_d2_s91 ✓ CORRECT exact

Expected: ggbbpp | Predicted: ggbbpp

comp_d2_s92 ✓ CORRECT exact

Expected: gbl | Predicted: gbl

comp_d2_s93 ✓ CORRECT exact

Expected: dzgzd | Predicted: dzgzd

comp_d2_s94 ✓ CORRECT exact

Expected: khoohknzzn | Predicted: khoohknzzn

comp_d2_s97 ✓ CORRECT exact

Expected: rwetseg | Predicted: rwetseg

comp_d2_s0 ✗ INCORRECT mismatch

Expected: adfijjkuwx | Predicted: adjfikuuwx

comp_d2_s12 ✗ INCORRECT mismatch

Expected: 3dhq43z | Predicted: idhq4iz

comp_d2_s15 ✗ INCORRECT mismatch

Expected: tjooppooiidd | Predicted: tjoopoooiidd

comp_d2_s2 ✗ INCORRECT mismatch

Expected: zujuanlrrzdz | Predicted: zzujuanlrrzdz

comp_d2_s20 ✗ INCORRECT mismatch

Expected: hoqrubce | Predicted: hooqrbce

comp_d2_s31 ✗ INCORRECT mismatch

Expected: eglljjxxkkuurr | Predicted: eglljjxxkkuu rr

comp_d2_s36 ✗ INCORRECT mismatch

Expected: zpt5ggkr | Predicted: zptu5gkr

comp_d2_s37 ✗ INCORRECT mismatch

Expected: r4l55nk22kn55l4r | Predicted: r4l55nk2eknuul5r

comp_d2_s38 ✗ INCORRECT mismatch

Expected: wbmjyyjmbw | Predicted: wbmjyymjbw

comp_d2_s48 ✗ INCORRECT mismatch

Expected: ijkllrye | Predicted: ijkllyre

comp_d2_s56 ✗ INCORRECT mismatch

Expected: omgmjdf | Predicted: jomgmdf

comp_d2_s61 ✗ INCORRECT mismatch

Expected: w3z3ktncmfp | Predicted: wiziktncmfp

comp_d2_s66 ✗ INCORRECT mismatch

Expected: ooffkk | Predicted: ooofffkk

comp_d2_s70 ✗ INCORRECT no_answer

Expected: shvjrzpusr | Predicted:

comp_d2_s72 ✗ INCORRECT mismatch

Expected: iiddrrkkuuxxwwttbbddggoossee | Predicted: iiddrrkkuuxxwwtbbddggossee

comp_d2_s73 ✗ INCORRECT mismatch

Expected: zzmmjjyyllnnoorrrroonnllyyjjmmzz | Predicted: zzmmjjyyllnnoorrrnnoollyyjjmmzz

comp_d2_s75 ✗ INCORRECT mismatch

Expected: oooxxkkiiuuhho | Predicted: ooxxkkiiuuhhh

comp_d2_s76 ✗ INCORRECT mismatch

Expected: fhhortvyhs | Predicted: ffhhoortvyhs

comp_d2_s81 ✗ INCORRECT no_answer

Expected: xdnhvitqsa | Predicted:

comp_d2_s95 ✗ INCORRECT mismatch

Expected: sssskkkkhhhheeeennnnggggffffwwww | Predicted: sssskkkkhhhhheeeennnnggggffffwwww

comp_d2_s96 ✗ INCORRECT mismatch

Expected: kg3cloz | Predicted: kgicloz

comp_d2_s98 ✗ INCORRECT mismatch

Expected: fxhxfetxtefxhxf | Predicted: fxhxfetxtextfxhxf

comp_d2_s99 ✗ INCORRECT mismatch

Expected: uujjnnjjggnnmmccccmmnnggjjnnjjuu | Predicted: uujjnnjjggnnmmccmmccnnggnnjjnnuu

📊 Level 3 - 3 Functions Nested

Accuracy
58.0%
58/100 Correct
Answer Rate
94.0%
94/100 Answered
Total Samples
100
Test Cases

📈 Match Type Distribution

Match Type Count Percentage
exact 58 58.0%
mismatch 36 36.0%
no_answer 6 6.0%

🔍 Failure Analysis

Failure Type Count Percentage
No Answer Generated 6 14.3%
Wrong Answer 36 85.7%

Common Failure Patterns:

  • Wrong Answers (36): Model completed execution but computed incorrect result
  • Missing Answer Tags (6): Completed but didn't produce [ANSWER] tag

📋 Sample Details

comp_d3_s0 ✓ CORRECT exact

Expected: fdgypkkpygdf4kk4 | Predicted: fdgypkkpygdf4kk4

comp_d3_s10 ✓ CORRECT exact

Expected: ggggoooosssszzzz | Predicted: ggggoooosssszzzz

comp_d3_s12 ✓ CORRECT exact

Expected: hpsm2 | Predicted: hpsm2

comp_d3_s14 ✓ CORRECT exact

Expected: lhsv5k | Predicted: lhsv5k

comp_d3_s17 ✓ CORRECT exact

Expected: ccffffttttcc | Predicted: ccffffttttcc

comp_d3_s2 ✓ CORRECT exact

Expected: yrtkzynv | Predicted: yrtkzynv

comp_d3_s20 ✓ CORRECT exact

Expected: relmunoi | Predicted: relmunoi

comp_d3_s21 ✓ CORRECT exact

Expected: llllllllaaaaffffhhhhssss | Predicted: llllllllaaaaffffhhhhssss

comp_d3_s22 ✓ CORRECT exact

Expected: vsdjxprig | Predicted: vsdjxprig

comp_d3_s23 ✓ CORRECT exact

Expected: qsqzomdilbpok | Predicted: qsqzomdilbpok

comp_d3_s24 ✓ CORRECT exact

Expected: znmrkv | Predicted: znmrkv

comp_d3_s25 ✓ CORRECT exact

Expected: hhn2lt45 | Predicted: hhn2lt45

comp_d3_s26 ✓ CORRECT exact

Expected: 4g4z | Predicted: 4g4z

comp_d3_s27 ✓ CORRECT exact

Expected: zcvthz | Predicted: zcvthz

comp_d3_s30 ✓ CORRECT exact

Expected: sdssvvttrrmmzw | Predicted: sdssvvttrrmmzw

comp_d3_s31 ✓ CORRECT exact

Expected: qqqqwwwwccccnnnn | Predicted: qqqqwwwwccccnnnn

comp_d3_s34 ✓ CORRECT exact

Expected: uldwlzlzxwhwwc | Predicted: uldwlzlzxwhwwc

comp_d3_s36 ✓ CORRECT exact

Expected: vzkdbjsc | Predicted: vzkdbjsc

comp_d3_s37 ✓ CORRECT exact

Expected: qrbop | Predicted: qrbop

comp_d3_s38 ✓ CORRECT exact

Expected: xddsjlqlfqd | Predicted: xddsjlqlfqd

comp_d3_s4 ✓ CORRECT exact

Expected: nnggyyllggsshhvvcczznnvvggkk | Predicted: nnggyyllggsshhvvcczznnvvggkk

comp_d3_s40 ✓ CORRECT exact

Expected: gguuuuuuuugg | Predicted: gguuuuuuuugg

comp_d3_s42 ✓ CORRECT exact

Expected: p1nqnkscnqhbb | Predicted: p1nqnkscnqhbb

comp_d3_s46 ✓ CORRECT exact

Expected: zzqqmmgg | Predicted: zzqqmmgg

comp_d3_s47 ✓ CORRECT exact

Expected: cosjg | Predicted: cosjg

comp_d3_s48 ✓ CORRECT exact

Expected: wabbaw | Predicted: wabbaw

comp_d3_s5 ✓ CORRECT exact

Expected: jmqccvvqqiizz | Predicted: jmqccvvqqiizz

comp_d3_s52 ✓ CORRECT exact

Expected: ngmbzswsc | Predicted: ngmbzswsc

comp_d3_s55 ✓ CORRECT exact

Expected: zff33ggdd22rrvvz | Predicted: zff33ggdd22rrvvz

comp_d3_s56 ✓ CORRECT exact

Expected: nrccudbuveygj | Predicted: nrccudbuveygj

comp_d3_s58 ✓ CORRECT exact

Expected: mjljxhldpbyl | Predicted: mjljxhldpbyl

comp_d3_s6 ✓ CORRECT exact

Expected: fwhyrhorewaf | Predicted: fwhyrhorewaf

comp_d3_s63 ✓ CORRECT exact

Expected: cfgmn | Predicted: cfgmn

comp_d3_s64 ✓ CORRECT exact

Expected: euwbi | Predicted: euwbi

comp_d3_s65 ✓ CORRECT exact

Expected: hrrddssyyyykkzzzzxpjrh | Predicted: hrrddssyyyykkzzzzxpjrh

comp_d3_s66 ✓ CORRECT exact

Expected: ydtgxzf | Predicted: ydtgxzf

comp_d3_s67 ✓ CORRECT exact

Expected: fsjblks | Predicted: fsjblks

comp_d3_s68 ✓ CORRECT exact

Expected: 22bbvvhh33yyjjv | Predicted: 22bbvvhh33yyjjv

comp_d3_s69 ✓ CORRECT exact

Expected: ppmqpebyl | Predicted: ppmqpebyl

comp_d3_s71 ✓ CORRECT exact

Expected: aehcvikmrfcvq | Predicted: aehcvikmrfcvq

comp_d3_s72 ✓ CORRECT exact

Expected: jjlgmfg | Predicted: jjlgmfg

comp_d3_s73 ✓ CORRECT exact

Expected: nwovk | Predicted: nwovk

comp_d3_s75 ✓ CORRECT exact

Expected: vgxxbcrffrcbxxgv | Predicted: vgxxbcrffrcbxxgv

comp_d3_s76 ✓ CORRECT exact

Expected: wwnnxxaassppuucc | Predicted: wwnnxxaassppuucc

comp_d3_s77 ✓ CORRECT exact

Expected: wgromekqpzccuw | Predicted: wgromekqpzccuw

comp_d3_s80 ✓ CORRECT exact

Expected: innddwwi | Predicted: innddwwi

comp_d3_s82 ✓ CORRECT exact

Expected: hk2443vg | Predicted: hk2443vg

comp_d3_s86 ✓ CORRECT exact

Expected: mhxkpzv | Predicted: mhxkpzv

comp_d3_s87 ✓ CORRECT exact

Expected: mosth | Predicted: mosth

comp_d3_s88 ✓ CORRECT exact

Expected: wkxjdbqpqmmmmqpqbd | Predicted: wkxjdbqpqmmmmqpqbd

comp_d3_s90 ✓ CORRECT exact

Expected: euudmqoqah | Predicted: euudmqoqah

comp_d3_s92 ✓ CORRECT exact

Expected: icnhbbyllybbhnciicnhbbyllybbhnci | Predicted: icnhbbyllybbhnciicnhbbyllybbhnci

comp_d3_s93 ✓ CORRECT exact

Expected: zf1nds51fex | Predicted: zf1nds51fex

comp_d3_s94 ✓ CORRECT exact

Expected: bbbbccccpppp | Predicted: bbbbccccpppp

comp_d3_s95 ✓ CORRECT exact

Expected: ldgfrakc | Predicted: ldgfrakc

comp_d3_s96 ✓ CORRECT exact

Expected: hknqx | Predicted: hknqx

comp_d3_s98 ✓ CORRECT exact

Expected: bc2h3m55m3h2cb | Predicted: bc2h3m55m3h2cb

comp_d3_s99 ✓ CORRECT exact

Expected: ixwrwcvekzh | Predicted: ixwrwcvekzh

comp_d3_s1 ✗ INCORRECT mismatch

Expected: vzqmeueggeuemqzvogo | Predicted: vzqmeueggemuezqvgogo

comp_d3_s11 ✗ INCORRECT mismatch

Expected: ggggqqppffcchhyyhhmmbbhhhhuu | Predicted: ggggqqppffccchhyyhhmmbbhhhhuu

comp_d3_s13 ✗ INCORRECT mismatch

Expected: iijjuuuuyyzzzzeeffhh | Predicted: ffhhiijjuuyyzzeeee

comp_d3_s15 ✗ INCORRECT mismatch

Expected: bbmmssuuzz | Predicted: bbmssssssuum

comp_d3_s16 ✗ INCORRECT no_answer

Expected: oouujjxxnnqqccttrr | Predicted:

comp_d3_s18 ✗ INCORRECT mismatch

Expected: n31ckzdy5pp5ydzkc13n | Predicted: 3112kzdy5pp5dyzdk2113

comp_d3_s19 ✗ INCORRECT mismatch

Expected: lshiunj | Predicted: lshunj

comp_d3_s28 ✗ INCORRECT mismatch

Expected: dflpqrtvww | Predicted: df3lpqrwvw

comp_d3_s29 ✗ INCORRECT mismatch

Expected: npezhoynnyohzepnnpezhoynnyohzepn | Predicted: npezhoynnyhozepnnpezhoynnyhozepn

comp_d3_s3 ✗ INCORRECT mismatch

Expected: qrrankasnhnsfllfsnhnsaknarrq | Predicted: qrrankasnhnsfllfhsnhnsaknarq

comp_d3_s32 ✗ INCORRECT mismatch

Expected: cpqswwwz | Predicted: ccppqswwwz

comp_d3_s33 ✗ INCORRECT no_answer

Expected: vmihodwuxuwdohimv | Predicted:

comp_d3_s35 ✗ INCORRECT mismatch

Expected: cflqrstbcc | Predicted: cfflqrsstbcc

comp_d3_s39 ✗ INCORRECT mismatch

Expected: hrwwrhcopp | Predicted: hrrwwrrhcopp

comp_d3_s41 ✗ INCORRECT mismatch

Expected: qqbbxxzzyyddddyyzz | Predicted: qqbbxxzzyyddddyzz

comp_d3_s43 ✗ INCORRECT mismatch

Expected: bbekkqqrrssssvvwwz | Predicted: bbsbkkqqrrsssssvvwz

comp_d3_s44 ✗ INCORRECT mismatch

Expected: dpbdsseyyessdbpd | Predicted: dpbdsseyeyesbdpd

comp_d3_s45 ✗ INCORRECT mismatch

Expected: djtw | Predicted: djtwt

comp_d3_s49 ✗ INCORRECT mismatch

Expected: gxxiivvooddrrbbyywwffg | Predicted: ggxxiivvooddrrbbyywwff

comp_d3_s50 ✗ INCORRECT mismatch

Expected: bmhvioafoofaoivhmb | Predicted: bmhvioafoofaoviomb

comp_d3_s51 ✗ INCORRECT mismatch

Expected: 5dy1t4 | Predicted: 5dyat4u

comp_d3_s53 ✗ INCORRECT mismatch

Expected: kxltnqkppkqn | Predicted: kxltnqkpplkqn

comp_d3_s54 ✗ INCORRECT mismatch

Expected: beefhipuupihfeebwyk | Predicted: beeefhipuupihfeeebwyk

comp_d3_s57 ✗ INCORRECT mismatch

Expected: xbzfqvccvqfzbxjqqqqjxbzfqvccvqfzbx | Predicted: xbzfqvcvqfzbxjqqqqjxbzfqvcvqfzbx

comp_d3_s59 ✗ INCORRECT no_answer

Expected: 5555bbddffhhnnqqxxzz | Predicted:

comp_d3_s60 ✗ INCORRECT no_answer

Expected: nnkk2233ffxx55xx | Predicted:

comp_d3_s61 ✗ INCORRECT no_answer

Expected: ttzzzzccrrggvvxx | Predicted:

comp_d3_s62 ✗ INCORRECT mismatch

Expected: jvepusxaaxsupevj | Predicted: jvepusxajxuspevj

comp_d3_s7 ✗ INCORRECT mismatch

Expected: eeggvvuurrccttiiroj | Predicted: eeggvvuurrccttiroj

comp_d3_s70 ✗ INCORRECT mismatch

Expected: ssssssgghhggbboovviizz | Predicted: ssssssghhggbbboovviiizz

comp_d3_s74 ✗ INCORRECT mismatch

Expected: d2nstxzz | Predicted: dd2znstx

comp_d3_s78 ✗ INCORRECT mismatch

Expected: rrrrbbbbmmmmggggjjjjiiiiuuuuiiii | Predicted: rrrrbbbbmmmmggggjjjjiiiuuuuiiii

comp_d3_s79 ✗ INCORRECT mismatch

Expected: hthkq2fvttttvf2qkh | Predicted: hthkq2fvtttvtf2qkh

comp_d3_s8 ✗ INCORRECT mismatch

Expected: itzziiqquullffoojjeeeejjoofflluuqqiizz | Predicted: itzziiqquullffoojjeeeoojjlluuffiiqqiizz

comp_d3_s81 ✗ INCORRECT mismatch

Expected: bjfjmkxrgmmgrxkmjfjb | Predicted: bjfjmkxrgmmgrxkjmfjb

comp_d3_s83 ✗ INCORRECT mismatch

Expected: defmnruwzzwurnmfeddefmnruwzzwurnmfed | Predicted: defmnruwzzwurmnfeddefmnruwzzwurmnfed

comp_d3_s84 ✗ INCORRECT mismatch

Expected: aaaaffffhhhhllllnnnnppppqqqqzzzz | Predicted: aaaaaaaaffffhhhhllllnnnnppppqqqqzzzz

comp_d3_s85 ✗ INCORRECT mismatch

Expected: dghhistwxx | Predicted: ddghhiiostwxx

comp_d3_s89 ✗ INCORRECT mismatch

Expected: jhbwzizxzzxzizwbhj | Predicted: jhbwzizxzzxizwbhj

comp_d3_s9 ✗ INCORRECT mismatch

Expected: jjjjhhhhddddeeeellllooookkkkaaaajjjjccccjjccyy | Predicted: jjjjhhhhddddeelllleeeeeookkkkaaaaajjjjccccjjccyy

comp_d3_s91 ✗ INCORRECT no_answer

Expected: zzzzzzzzjjjjjjjjaaaaaaaaffffffffjjjjjjjjaaaaaaaaiiiiiiiivvvvvvvveeeeeeee | Predicted:

comp_d3_s97 ✗ INCORRECT mismatch

Expected: fgnfhbmfy31rxxr13yfmbh | Predicted: fgnfhbmfy13rxxar1yfmbh

📊 Level 4 - 4 Functions Nested

Accuracy
39.0%
39/100 Correct
Answer Rate
84.0%
84/100 Answered
Total Samples
100
Test Cases

📈 Match Type Distribution

Match Type Count Percentage
mismatch 45 45.0%
exact 39 39.0%
no_answer 16 16.0%

🔍 Failure Analysis

Failure Type Count Percentage
No Answer Generated 16 26.2%
Wrong Answer 45 73.8%

Common Failure Patterns:

  • Wrong Answers (45): Model completed execution but computed incorrect result
  • Missing Answer Tags (16): Completed but didn't produce [ANSWER] tag

📋 Sample Details

comp_d4_s0 ✓ CORRECT exact

Expected: uremmxlbcxupw | Predicted: uremmxlbcxupw

comp_d4_s1 ✓ CORRECT exact

Expected: nsczri | Predicted: nsczri

comp_d4_s12 ✓ CORRECT exact

Expected: bdtzzzztdbbdtzzzztdb | Predicted: bdtzzzztdbbdtzzzztdb

comp_d4_s19 ✓ CORRECT exact

Expected: krmcdkuoyduh | Predicted: krmcdkuoyduh

comp_d4_s2 ✓ CORRECT exact

Expected: wwbbmmggzzbbffqqkkaayy | Predicted: wwbbmmggzzbbffqqkkaayy

comp_d4_s20 ✓ CORRECT exact

Expected: okob | Predicted: okob

comp_d4_s26 ✓ CORRECT exact

Expected: bqczqhtpowwopthqzcqb | Predicted: bqczqhtpowwopthqzcqb

comp_d4_s27 ✓ CORRECT exact

Expected: nnnnnnnnjjjjxxxx | Predicted: nnnnnnnnjjjjxxxx

comp_d4_s28 ✓ CORRECT exact

Expected: dlfnh | Predicted: dlfnh

comp_d4_s30 ✓ CORRECT exact

Expected: nkzqwbg | Predicted: nkzqwbg

comp_d4_s36 ✓ CORRECT exact

Expected: ycnu5vmmv5prbk | Predicted: ycnu5vmmv5prbk

comp_d4_s38 ✓ CORRECT exact

Expected: fzlhvxjogkcbhyuy | Predicted: fzlhvxjogkcbhyuy

comp_d4_s4 ✓ CORRECT exact

Expected: 3v3 | Predicted: 3v3

comp_d4_s42 ✓ CORRECT exact

Expected: 4h2tnktdxw | Predicted: 4h2tnktdxw

comp_d4_s43 ✓ CORRECT exact

Expected: nkr | Predicted: nkr

comp_d4_s47 ✓ CORRECT exact

Expected: krbvm4 | Predicted: krbvm4

comp_d4_s48 ✓ CORRECT exact

Expected: oygrjk | Predicted: oygrjk

comp_d4_s5 ✓ CORRECT exact

Expected: jcf3vzdsscp2ykl54 | Predicted: jcf3vzdsscp2ykl54

comp_d4_s50 ✓ CORRECT exact

Expected: ttnnllppkkxxeeaassrryyffiihh | Predicted: ttnnllppkkxxeeaassrryyffiihh

comp_d4_s57 ✓ CORRECT exact

Expected: bcdeilms | Predicted: bcdeilms

comp_d4_s58 ✓ CORRECT exact

Expected: sgdmzh | Predicted: sgdmzh

comp_d4_s61 ✓ CORRECT exact

Expected: mhjwfc | Predicted: mhjwfc

comp_d4_s62 ✓ CORRECT exact

Expected: eeuuggmmrriizzttbbppkkggrr | Predicted: eeuuggmmrriizzttbbppkkggrr

comp_d4_s63 ✓ CORRECT exact

Expected: irjdutulm | Predicted: irjdutulm

comp_d4_s66 ✓ CORRECT exact

Expected: cornsgjpbyffybpjgsnroc | Predicted: cornsgjpbyffybpjgsnroc

comp_d4_s67 ✓ CORRECT exact

Expected: bvg | Predicted: bvg

comp_d4_s70 ✓ CORRECT exact

Expected: psz | Predicted: psz

comp_d4_s73 ✓ CORRECT exact

Expected: wyyqqqqbbttw | Predicted: wyyqqqqbbttw

comp_d4_s74 ✓ CORRECT exact

Expected: sdrzis | Predicted: sdrzis

comp_d4_s75 ✓ CORRECT exact

Expected: dimggkkooookkgg | Predicted: dimggkkooookkgg

comp_d4_s8 ✓ CORRECT exact

Expected: xdszeoezsdxrxdszeoezsdxrr | Predicted: xdszeoezsdxrxdszeoezsdxrr

comp_d4_s80 ✓ CORRECT exact

Expected: bxhhxb | Predicted: bxhhxb

comp_d4_s87 ✓ CORRECT exact

Expected: vvssssvvpppp | Predicted: vvssssvvpppp

comp_d4_s88 ✓ CORRECT exact

Expected: lvbz | Predicted: lvbz

comp_d4_s89 ✓ CORRECT exact

Expected: nqbhkhvlk | Predicted: nqbhkhvlk

comp_d4_s9 ✓ CORRECT exact

Expected: tkkaaccjjbbbbwwork | Predicted: tkkaaccjjbbbbwwork

comp_d4_s91 ✓ CORRECT exact

Expected: xopeobjnju2dtb3 | Predicted: xopeobjnju2dtb3

comp_d4_s97 ✓ CORRECT exact

Expected: dm5sz | Predicted: dm5sz

comp_d4_s99 ✓ CORRECT exact

Expected: iariffvvvv | Predicted: iariffvvvv

comp_d4_s10 ✗ INCORRECT mismatch

Expected: acghotttuuww | Predicted: acghoottttuuw

comp_d4_s11 ✗ INCORRECT no_answer

Expected: zzzzzzzzbbbbbbbbttttttttffffffffwwwwwwwwwwwwwwwwllllllllffffffff | Predicted:

comp_d4_s13 ✗ INCORRECT mismatch

Expected: ddddddddeeeeggggiiiippppssss | Predicted: dddddddddeeeeggggiiiiiiippppppppssssss

comp_d4_s14 ✗ INCORRECT mismatch

Expected: fkncynf | Predicted: fkncyfn

comp_d4_s15 ✗ INCORRECT mismatch

Expected: yzzalpqrvx | Predicted: vyyzalppqrx

comp_d4_s16 ✗ INCORRECT mismatch

Expected: hhxxmmmmhhhhqqqqiiiicccccccczzzzuuuunnnnqqqqoo | Predicted: hhxxmmhhhhqqqqiicccczzzzuuuunnqqoo

comp_d4_s17 ✗ INCORRECT mismatch

Expected: nnssqqyyttddlliiccbbuuuubbcciillddttyyqqssnnnnssqqyyttddlliiccbbuuuubbcciillddttyyqqssnn | Predicted: nnssqqyyttddlliiccbbuunnssqqyyttddlliiccbbuunnssqqyyttddlliiccbbuunnssqqyyttddlliiccbbuu

comp_d4_s18 ✗ INCORRECT no_answer

Expected: 1111llllnnnnppppqqqqqqqqrrrrsssstttt | Predicted:

comp_d4_s21 ✗ INCORRECT mismatch

Expected: xwdeeffffuuvv | Predicted: xwdeefffvfvv

comp_d4_s22 ✗ INCORRECT mismatch

Expected: cqhhiijlloopppqqry | Predicted: cqhiiijlloooopppqrry

comp_d4_s23 ✗ INCORRECT mismatch

Expected: iimmlluummcciiuuhhjjrr | Predicted: rriiimmeluummcciiuuhhjjj

comp_d4_s24 ✗ INCORRECT mismatch

Expected: cdjknprwxxwrpnkjdc | Predicted: cdjknprwdxcdjknprwdx

comp_d4_s25 ✗ INCORRECT mismatch

Expected: fmbeywdty | Predicted: fmboywdty

comp_d4_s29 ✗ INCORRECT no_answer

Expected: qkgdppdgkq | Predicted:

comp_d4_s3 ✗ INCORRECT no_answer

Expected: lwldt1k14v5zqxxqz5v41k1tdlwl | Predicted:

comp_d4_s31 ✗ INCORRECT mismatch

Expected: 2255eggrrvvvwz | Predicted: 222555ggrrvvvwez

comp_d4_s32 ✗ INCORRECT mismatch

Expected: cfvvvvppppeeeexxxxppppwwwwnnnnttttccccccccttttnnnnwwwwppppxxxxeeeeppppvvvv | Predicted: cfvvvvppppeeexxxxppppwwwwnnnnttttcccccccccttttnnnnwwwppppxxxeeeeeppppvvvv

comp_d4_s33 ✗ INCORRECT mismatch

Expected: uujjppttssccff | Predicted: uujjpppttssssttccff

comp_d4_s34 ✗ INCORRECT mismatch

Expected: 111155ffggjjnnnntthw | Predicted: 111555fgfggjnnnnntthw

comp_d4_s35 ✗ INCORRECT mismatch

Expected: zzkkuummjjooggqqggwwoopp | Predicted: zzkkuummjjooqqggwwoopppp

comp_d4_s37 ✗ INCORRECT mismatch

Expected: cjprz | Predicted: c c j j p p r r r z z

comp_d4_s39 ✗ INCORRECT no_answer

Expected: bqzrf3yn1d | Predicted:

comp_d4_s40 ✗ INCORRECT mismatch

Expected: bbggjjmmrrvvxxyy | Predicted: bbgggjjmmrrssvvxyxy

comp_d4_s41 ✗ INCORRECT no_answer

Expected: wffgq22qgffwwffgq22qgffwwffgq22qgffwwffgq22qgffw | Predicted:

comp_d4_s44 ✗ INCORRECT no_answer

Expected: 31r22gghhyy3322llpip | Predicted:

comp_d4_s45 ✗ INCORRECT mismatch

Expected: rrrrvvvvssssoooozzzz | Predicted: rrrrvvvvssssoozzzz

comp_d4_s46 ✗ INCORRECT mismatch

Expected: dyduemmeudyd | Predicted: dydyuemmeuydyd

comp_d4_s49 ✗ INCORRECT mismatch

Expected: kmpqwynulc | Predicted: kmpqwywnulc

comp_d4_s51 ✗ INCORRECT mismatch

Expected: oakccqqccyycyycyyccqqcc | Predicted: oakccqqccyycyycyqqcc

comp_d4_s52 ✗ INCORRECT mismatch

Expected: aaaaaaaaddddddddeeeeeeeejjjjjjjjllllllllnnnnnnnnxxxxxxxx | Predicted: aaaa ddddddeeeeee jjjjjjjllllllllnnnnnnnnxxxxxxxx

comp_d4_s53 ✗ INCORRECT no_answer

Expected: dln4pqtwy | Predicted:

comp_d4_s54 ✗ INCORRECT no_answer

Expected: k21ntdhnrnhdtn12k33 | Predicted:

comp_d4_s55 ✗ INCORRECT mismatch

Expected: 2gl4ps55w | Predicted: 2gl4pss5w

comp_d4_s56 ✗ INCORRECT mismatch

Expected: ggggkkkkttttxxxxxxxxttttkkkkgggg | Predicted: ggggkkkkttttxxxxttttkkkkgggg

comp_d4_s59 ✗ INCORRECT mismatch

Expected: 2244ccddggkkssttxx11 | Predicted: 224422ccddeeeeggkksssttxx11

comp_d4_s6 ✗ INCORRECT mismatch

Expected: lzzaattoollvpu | Predicted: lzzaatttoolllvpul

comp_d4_s60 ✗ INCORRECT mismatch

Expected: ccffggllttttuu | Predicted: ccfffggllttttttu

comp_d4_s64 ✗ INCORRECT no_answer

Expected: 44mm444433mmww | Predicted:

comp_d4_s65 ✗ INCORRECT no_answer

Expected: tehkicrbyeybrcikhet | Predicted:

comp_d4_s68 ✗ INCORRECT mismatch

Expected: royocccellnoooorrsssttttuu | Predicted: royocccelnnoorssttttttu

comp_d4_s69 ✗ INCORRECT no_answer

Expected: qvigojoljizk | Predicted:

comp_d4_s7 ✗ INCORRECT mismatch

Expected: kksssstthh | Predicted: kkssstt sshh

comp_d4_s71 ✗ INCORRECT mismatch

Expected: celnoxzfi | Predicted: cenolxzfi

comp_d4_s72 ✗ INCORRECT mismatch

Expected: cehkooprsttvxxxxvttsrpookhec | Predicted: cehkoooprsttxvvtsttxprooohkce

comp_d4_s76 ✗ INCORRECT mismatch

Expected: cckkjj22ffkkoooonnggggll | Predicted: ccjj22ffkkoooonnggggll

comp_d4_s77 ✗ INCORRECT mismatch

Expected: uhmiywvafngmsddsmgnf | Predicted: uhmiywfafngmsddgmsnf

comp_d4_s78 ✗ INCORRECT mismatch

Expected: abhkruw | Predicted: abhkrhruw

comp_d4_s79 ✗ INCORRECT no_answer

Expected: xxxxxxxxmmmmmmmmppppppppccccccccccccccccppppppppmmmmmmmmxxxxxxxx | Predicted:

comp_d4_s81 ✗ INCORRECT no_answer

Expected: vvqqwwxxvv1144rrjjhh | Predicted:

comp_d4_s82 ✗ INCORRECT mismatch

Expected: ddiikkllllnnoopprrss | Predicted: ddikiklllnooopprrss

comp_d4_s83 ✗ INCORRECT mismatch

Expected: ioydirfdrrdfridyoixlxu | Predicted: ioydirfdrrdfriydoixlxu

comp_d4_s84 ✗ INCORRECT mismatch

Expected: eeddaazzqqppddlyx | Predicted: eedddaazzqqppddlyx

comp_d4_s85 ✗ INCORRECT mismatch

Expected: mekmefkzyudvvduyzkfemkem | Predicted: mekmefkzyudvmekmefkzyudv

comp_d4_s86 ✗ INCORRECT no_answer

Expected: eeyyllqqqqllyyeeeeyyllqqqqllyyeeeeyyllqqqqllyyeeeeyyllqqqqllyyee | Predicted:

comp_d4_s90 ✗ INCORRECT mismatch

Expected: iiiiqqqqsssseeeewwwwccccep | Predicted: iiiiqqqqsssssseeeewwwwccccep

comp_d4_s92 ✗ INCORRECT mismatch

Expected: 11fmmmpy | Predicted: 11fmmppy

comp_d4_s93 ✗ INCORRECT no_answer

Expected: pp11wwhh | Predicted:

comp_d4_s94 ✗ INCORRECT mismatch

Expected: wrvhxbjbxhvrw | Predicted: wrvhxbjbhxvrw

comp_d4_s95 ✗ INCORRECT mismatch

Expected: soozyycchhjjjjbblloorrpp | Predicted: soozyyccjjjjbbllloorrpp

comp_d4_s96 ✗ INCORRECT mismatch

Expected: 3cfkmprwyz | Predicted: cfikmprwyz

comp_d4_s98 ✗ INCORRECT mismatch

Expected: ggqqrr | Predicted: rggqqr

📊 Level 5 - 5 Functions Nested

Accuracy
21.0%
21/100 Correct
Answer Rate
68.0%
68/100 Answered
Total Samples
100
Test Cases

📈 Match Type Distribution

Match Type Count Percentage
mismatch 47 47.0%
no_answer 32 32.0%
exact 21 21.0%

🔍 Failure Analysis

Failure Type Count Percentage
No Answer Generated 32 40.5%
Wrong Answer 47 59.5%

Common Failure Patterns:

  • Wrong Answers (47): Model completed execution but computed incorrect result
  • Missing Answer Tags (32): Completed but didn't produce [ANSWER] tag

📋 Sample Details

comp_d5_s2 ✓ CORRECT exact

Expected: wdf1k1fdwiuc | Predicted: wdf1k1fdwiuc

comp_d5_s20 ✓ CORRECT exact

Expected: wigcrgrunrzilr | Predicted: wigcrgrunrzilr

comp_d5_s22 ✓ CORRECT exact

Expected: bbhhkkzz | Predicted: bbhhkkzz

comp_d5_s24 ✓ CORRECT exact

Expected: mmnnyysx | Predicted: mmnnyysx

comp_d5_s27 ✓ CORRECT exact

Expected: zzvvzzaa55bbddllrr | Predicted: zzvvzzaa55bbddllrr

comp_d5_s28 ✓ CORRECT exact

Expected: bfiktvlebvcp | Predicted: bfiktvlebvcp

comp_d5_s33 ✓ CORRECT exact

Expected: ktxuictwjmdsp | Predicted: ktxuictwjmdsp

comp_d5_s45 ✓ CORRECT exact

Expected: 1qf5tfcwrcbq | Predicted: 1qf5tfcwrcbq

comp_d5_s5 ✓ CORRECT exact

Expected: vsqikkssxxssrrnnzzttmm | Predicted: vsqikkssxxssrrnnzzttmm

comp_d5_s52 ✓ CORRECT exact

Expected: xdmij | Predicted: xdmij

comp_d5_s61 ✓ CORRECT exact

Expected: 44wwhhssmm | Predicted: 44wwhhssmm

comp_d5_s63 ✓ CORRECT exact

Expected: pxbhijrgqjlz | Predicted: pxbhijrgqjlz

comp_d5_s64 ✓ CORRECT exact

Expected: spzjk4xg | Predicted: spzjk4xg

comp_d5_s7 ✓ CORRECT exact

Expected: jivpgcgloruyl | Predicted: jivpgcgloruyl

comp_d5_s71 ✓ CORRECT exact

Expected: jjmmyy | Predicted: jjmmyy

comp_d5_s78 ✓ CORRECT exact

Expected: aaddiisslgpq | Predicted: aaddiisslgpq

comp_d5_s79 ✓ CORRECT exact

Expected: yzptymjanj | Predicted: yzptymjanj

comp_d5_s83 ✓ CORRECT exact

Expected: cw2d | Predicted: cw2d

comp_d5_s88 ✓ CORRECT exact

Expected: demwwzzppmmqqzzjjrr | Predicted: demwwzzppmmqqzzjjrr

comp_d5_s9 ✓ CORRECT exact

Expected: psxyzpfri | Predicted: psxyzpfri

comp_d5_s98 ✓ CORRECT exact

Expected: dozzkaxsswwssiiiikkx | Predicted: dozzkaxsswwssiiiikkx

comp_d5_s0 ✗ INCORRECT no_answer

Expected: mvybn4r4qx | Predicted:

comp_d5_s1 ✗ INCORRECT mismatch

Expected: c22ffjjnn5511cc1155nnjjff22c | Predicted: 22jjnn55nu112211un55nnjj22

comp_d5_s10 ✗ INCORRECT mismatch

Expected: achklpqsxza | Predicted: ac hklpqszxa

comp_d5_s11 ✗ INCORRECT no_answer

Expected: vv22zzrrddbb11sseejjeess11bbddrrzz22vv | Predicted:

comp_d5_s12 ✗ INCORRECT no_answer

Expected: qqllllppwwgg55wwxxsskkffll | Predicted:

comp_d5_s13 ✗ INCORRECT mismatch

Expected: acjmpsz | Predicted: amcjpsz

comp_d5_s14 ✗ INCORRECT mismatch

Expected: 4cmprtx | Predicted: 4cmtx p

comp_d5_s15 ✗ INCORRECT no_answer

Expected: zh2sph543j | Predicted:

comp_d5_s16 ✗ INCORRECT no_answer

Expected: 1122ccddhhllppqvvw | Predicted:

comp_d5_s17 ✗ INCORRECT no_answer

Expected: 33cckkkkppppttttvvvvvvvv | Predicted:

comp_d5_s18 ✗ INCORRECT no_answer

Expected: bbmmmmmmmmmmmmbbtt33qqttttqq33ttbbmmmmmmmmmmmmbb | Predicted:

comp_d5_s19 ✗ INCORRECT no_answer

Expected: dw1ylrzrly1wd | Predicted:

comp_d5_s21 ✗ INCORRECT mismatch

Expected: nrffhhffaaccyyrrvv | Predicted: nrffhhffaaccyymmrrvv

comp_d5_s23 ✗ INCORRECT mismatch

Expected: bcdffkmstxz | Predicted: bcdfkkmstxz

comp_d5_s25 ✗ INCORRECT no_answer

Expected: jjjjrrrrddddnnnnzzzziiiihhhhbbbbzzzzqqqqjjjjwwwwjjjjqqqqzzzzbbbbhhhhiiiizzzznnnndddd | Predicted:

comp_d5_s26 ✗ INCORRECT mismatch

Expected: mxleazgesxqqxsegzaelxm | Predicted: mxleazgesxqqxsgezalexm

comp_d5_s29 ✗ INCORRECT mismatch

Expected: jewcccceeeeggjjppqqw | Predicted: jewcccceeeegjjpppwq

comp_d5_s3 ✗ INCORRECT mismatch

Expected: ehoruyyuroheccccehoruyyuroheccccccccehoruyyuroheccccehoruyyurohe | Predicted: hhooryuuyoorhhceecrhhrooyuuyoorhhceceecerhhrooyuuyoorhhceecrhhrooyuuyoorhhce

comp_d5_s30 ✗ INCORRECT no_answer

Expected: 2222ccccgggghhhhnnppppppppqqrrrrssssvvvv | Predicted:

comp_d5_s31 ✗ INCORRECT mismatch

Expected: zzqqwwqqzzqqwwqqzz | Predicted: qqwwzz zzqqzzww wwzzqqzz zzqqww

comp_d5_s32 ✗ INCORRECT mismatch

Expected: bbeeffggjjllmmqqrrxxxxzzzzxxxxrrqqmmlljjggffeebb | Predicted: bbceefgggjllqqrrrxxyzzyxxrrrqqlljgggeefcbb

comp_d5_s34 ✗ INCORRECT mismatch

Expected: qqzzffddrrbbbbrrddffzzqqeejjll | Predicted: qqzzffddrrbbbbddrrffzzqqeejjll

comp_d5_s35 ✗ INCORRECT no_answer

Expected: tfpp44ffkknnpuq | Predicted:

comp_d5_s36 ✗ INCORRECT mismatch

Expected: ddooaazzxxppppxxzzaaooddddooaazzxxppppxxzzaaoodd | Predicted: ddooaazzxxppppxxaazzooddooddaazzxxppppxxaazzoodd

comp_d5_s37 ✗ INCORRECT no_answer

Expected: m5vlg41ty | Predicted:

comp_d5_s38 ✗ INCORRECT mismatch

Expected: uwk1hkln4pqqrsz | Predicted: uwk1hkk44pqqrsz

comp_d5_s39 ✗ INCORRECT mismatch

Expected: sooqqttttddiibjq | Predicted: sooqqtttddiibjq

comp_d5_s4 ✗ INCORRECT mismatch

Expected: oxabei | Predicted: ixoaobe

comp_d5_s40 ✗ INCORRECT mismatch

Expected: nrsbdhimnnmihdbsrnnrsbdhimnnmihdbsrn | Predicted: imnnrsbbdhhdbsbnnrmiimrnnbbsdhdbbbsnrnnmi

comp_d5_s41 ✗ INCORRECT mismatch

Expected: mw2cxhyhz44zhyhxc2 | Predicted: mw2cxhyh442hyhc2

comp_d5_s42 ✗ INCORRECT no_answer

Expected: 3qq33ppssvvllkk3aeau | Predicted:

comp_d5_s43 ✗ INCORRECT mismatch

Expected: oljndayttyadnjlooazdl | Predicted: oljndayttyadnjl ooazdl

comp_d5_s44 ✗ INCORRECT no_answer

Expected: sxx33cc3smh55s | Predicted:

comp_d5_s46 ✗ INCORRECT no_answer

Expected: pnbkslk13l31klsk | Predicted:

comp_d5_s47 ✗ INCORRECT no_answer

Expected: huzskvpdhzn3nzhdpvks | Predicted:

comp_d5_s48 ✗ INCORRECT mismatch

Expected: eehhqqmvhpphvmqqhheeoma | Predicted: eehhqqmvhpphvmqqhhheeoma

comp_d5_s49 ✗ INCORRECT no_answer

Expected: mzp11ddmmnn44ppqqqqsswwxx | Predicted:

comp_d5_s50 ✗ INCORRECT mismatch

Expected: iptkxjeinyn | Predicted: pitkxjkxjeinyn

comp_d5_s51 ✗ INCORRECT mismatch

Expected: ccddeeeeffhiijppvvxxxx | Predicted: ccccdeeeeffhiiijjppvvxx

comp_d5_s53 ✗ INCORRECT mismatch

Expected: bbbbbbccddgglnnrruuzzuurrnnlggddccbbbbbbcu | Predicted: bbbcclddgnnruuuzzuuurnngddlcbbbcuc

comp_d5_s54 ✗ INCORRECT mismatch

Expected: bvtnsvynjwuuwjnyvsntvb | Predicted: bvtnsvynjwuwjnyvntsvb

comp_d5_s55 ✗ INCORRECT mismatch

Expected: efjjlpwx | Predicted: efjjlpxw

comp_d5_s56 ✗ INCORRECT mismatch

Expected: oobbiikkllttrrccddttttddccrrttllkkiibboo | Predicted: oobbiikkllttrrccddtttddccrrttllkkiiibboo

comp_d5_s57 ✗ INCORRECT no_answer

Expected: bc2mrvzzvrm2cb | Predicted:

comp_d5_s58 ✗ INCORRECT no_answer

Expected: fvkbzvshsphkpc | Predicted:

comp_d5_s59 ✗ INCORRECT mismatch

Expected: bcdmqr | Predicted: bcdmrdq

comp_d5_s6 ✗ INCORRECT mismatch

Expected: ydrsm5wl1mjl2gjex | Predicted: ydrsm5511mj2gjex

comp_d5_s60 ✗ INCORRECT no_answer

Expected: 1dl4tvwz | Predicted:

comp_d5_s62 ✗ INCORRECT mismatch

Expected: 113344jjmmqqrrrrrrzzzzoeb | Predicted: 11114444jjmmqqrrrrrrzzzzzoeboeb

comp_d5_s65 ✗ INCORRECT no_answer

Expected: 3pjlwfgh45qqzbbzqq54hgfw | Predicted:

comp_d5_s66 ✗ INCORRECT mismatch

Expected: faaeellqqssttwwloolwwttssqqlleeaaf | Predicted: faabbceeeelllqssstwwloolwwstsssqqllleeeecbbaf

comp_d5_s67 ✗ INCORRECT no_answer

Expected: rhl3vdv3lhr | Predicted:

comp_d5_s68 ✗ INCORRECT mismatch

Expected: erggeeggmmoovvvvoommggee | Predicted: erggeeggmmvvoooommvvgggee

comp_d5_s69 ✗ INCORRECT no_answer

Expected: qqqqbbbbssssuuuubbbbddddgggguuuuuuuuggggddddbbbbuuuussssbbbbqqqq | Predicted:

comp_d5_s70 ✗ INCORRECT mismatch

Expected: nnddrrppiippkkttggkkddeeeeddkkggttkkppiipprrddnnggwwggdd | Predicted: nnddrripppiipppkkttggkkddeeeeddkggkkttkkiipprRDDnnddggwwggdd

comp_d5_s72 ✗ INCORRECT mismatch

Expected: arvvyzzyvvra | Predicted: aervvyzzvyvrea

comp_d5_s73 ✗ INCORRECT mismatch

Expected: aglnrxyyxrnlga | Predicted: aglrx y yxrlga

comp_d5_s74 ✗ INCORRECT mismatch

Expected: aaekkkkmmqqttvvwwwwwyy | Predicted: aaeikikkkkmmqqtqvttwwvw

comp_d5_s75 ✗ INCORRECT mismatch

Expected: abbddkkqqqquubbkkssoolla | Predicted: abbddddkkuu bbkkssoolll

comp_d5_s76 ✗ INCORRECT no_answer

Expected: xxzzzzzzzzffffffffsssssssseeeeeeeekkkkkkkkffyyyyqqxxxxxx | Predicted:

comp_d5_s77 ✗ INCORRECT mismatch

Expected: tjlhkkprzzrpkkhhkkprzzrpkkhhkkprzzrpkkhhkkprzzrpkkh | Predicted: tjlhhkkprzzprkkhhhkkprzzprkkhhhkkprzzprkkhhhkkprzzprkkhh

comp_d5_s8 ✗ INCORRECT mismatch

Expected: jmrvxq1dybcu | Predicted: jmrxxq1dybcu

comp_d5_s80 ✗ INCORRECT mismatch

Expected: llggiieeeemmiiccss | Predicted: llggiieeemmiicc ss

comp_d5_s81 ✗ INCORRECT no_answer

Expected: 22llmm44qqttxxyyptg | Predicted:

comp_d5_s82 ✗ INCORRECT mismatch

Expected: slzyb2mm2byzls | Predicted: slzyb2mm2bzyls

comp_d5_s84 ✗ INCORRECT mismatch

Expected: eehhuuttxxppjjccjjfflluuhhtttthhuullffjjccjjppxx | Predicted: eehhhuuttxxppjjccjjfflluuhhhtthhuullffjjccjjppxx

comp_d5_s85 ✗ INCORRECT no_answer

Expected: qqccnnqqqqssssyyyynnnnqqqqccccmmmmccccuuuurrrr | Predicted:

comp_d5_s86 ✗ INCORRECT mismatch

Expected: aaccjjllmmoopprttttuux | Predicted: aacccjjllmmoppprrtttttuuux

comp_d5_s87 ✗ INCORRECT mismatch

Expected: ppllrrccccffffggggiiiivvvvxxxxzzzz | Predicted: ppllrrccffffggggggggiivvxxzzzz

comp_d5_s89 ✗ INCORRECT mismatch

Expected: aaeeffgghhmmmmooqqvvvvwwwwzz | Predicted: aaeeffgghhmmmmooovvwwqqwwzz

comp_d5_s90 ✗ INCORRECT no_answer

Expected: b1qszwtp4wphv | Predicted:

comp_d5_s91 ✗ INCORRECT no_answer

Expected: 114444bbddggggqqqqxx | Predicted:

comp_d5_s92 ✗ INCORRECT no_answer

Expected: rxxwwssmmssmmllppr | Predicted:

comp_d5_s93 ✗ INCORRECT mismatch

Expected: ddddffffhhhhpppprrrrtttt | Predicted: ddddddddfffffhhhhpppprrrrtttt

comp_d5_s94 ✗ INCORRECT no_answer

Expected: oz1111bbbb2222jjjjllllvvvvwwwwxxxx | Predicted:

comp_d5_s95 ✗ INCORRECT no_answer

Expected: celnoprstuwyz | Predicted:

comp_d5_s96 ✗ INCORRECT mismatch

Expected: idwhkmstvzlntcaeggeactnlzvtsmkh | Predicted: idwhkmstvzlntcaeggaeclntzlvtsmkh

comp_d5_s97 ✗ INCORRECT no_answer

Expected: xynrylusnsulyrnyxukkuxynrylusnsulyrnyx | Predicted:

comp_d5_s99 ✗ INCORRECT mismatch

Expected: tdzhiqni | Predicted: tdzhiqn

Report generated: 2026-01-15 22:34:10